Zürich Statistical Office collects data on city and its residents. This data is published as Linked Data.
In this tutorial, we will show how to work with Linked Data. Mainly, we will see how to work with real estate dataset.
We will look into how to query, process, and visualize it.
Data on real estate market is published as Linked Data. It can be accessed with SPARQL queries.
You can send queries using HTTP requests. The API endpoint is https://ld.zazuko.com/query/.
Let's use SparqlClient from graphly to communicate with the database.
Graphly will allow us to:
pandas or geopandas# Installing dependencies for Colab environment
!pip install git+https://github.com/zazuko/graphly.git
import datetime
import re
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from graphly.api_client import SparqlClient
ENDPOINT = "https://ld.zazuko.com/query/"
sparql = SparqlClient(ENDPOINT)
sparql.add_prefixes({
"schema": "<http://schema.org/>",
"cube": "<https://cube.link/>",
"property": "<https://ld.stadt-zuerich.ch/statistics/property/>",
"measure": "<https://ld.stadt-zuerich.ch/statistics/measure/>",
"skos": "<http://www.w3.org/2004/02/skos/core#>",
"ssz": "<https://ld.stadt-zuerich.ch/statistics/>"
})
SPARQL queries can become very long. To improve the readibility, we will work wih prefixes.
Using add_prefixes method, we define persistent prefixes.
Every time you send a query, graphly will add automatically update the prefixes for you.
Let's find the average price per m2 for an apartment in Zurich. This data is available in QMP-EIG-HAA-OBJ-ZIM data cube. It will allow us to find the price per city district and apartment type. Thea data is also available at different points in time.
The query for housing prices in city of Zurich for different districts and apartment types over time is:
query = """
SELECT ?time ?place ?rooms ?price
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:QMP-EIG-HAA-OBJ-ZIM a cube:Cube;
cube:observationSet/cube:observation ?observation.
?observation property:TIME ?time ;
property:RAUM ?place_uri;
property:ZIM/schema:name ?rooms;
measure:QMP ?price .
?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place .
FILTER regex(str(?place),"ab|Stadtgebiet vor")
FILTER (?price > 0)
}
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | place | rooms | price | |
|---|---|---|---|---|
| 0 | 2009-12-31 | Kreis 9 (ab 1934) | 3- und 3,5-Zimmer Wohnung | 9012.0 |
| 1 | 2009-12-31 | Kreis 10 (ab 1934) | 3- und 3,5-Zimmer Wohnung | 8796.0 |
| 2 | 2009-12-31 | Kreis 7 (ab 1934) | 3- und 3,5-Zimmer Wohnung | 10607.0 |
| 3 | 2009-12-31 | Kreis 11 (ab 1970) | 3- und 3,5-Zimmer Wohnung | 5883.0 |
| 4 | 2009-12-31 | Kreis 10 (ab 1934) | 7- und 7,5-Zimmer Wohnung | 8404.0 |
Let's visualize housing prices per apartment type. To do this, we will aggregate the prices per rooms.
The cleaned dataframe becomes:
df.place = df.place.apply(lambda x: re.findall('Kreis \d+', x)[0])
df.rooms = df.rooms.apply(lambda x: int(re.findall('\d+', x)[0]))
plot_df = df[["rooms", "price"]][df.time == df.time.max()].groupby(["rooms"]).mean().astype(int).sort_values(by="rooms").reset_index()
plot_df
| rooms | price | |
|---|---|---|
| 0 | 1 | 12706 |
| 1 | 2 | 12841 |
| 2 | 3 | 12618 |
| 3 | 4 | 12138 |
| 4 | 5 | 13792 |
| 5 | 6 | 13361 |
| 6 | 7 | 12146 |
| 7 | 8 | 20804 |
fig = px.bar(plot_df, x="rooms", y="price")
fig.update_layout(
title='Housing prices in Zürich',
title_x=0.5,
yaxis_title="CHF per m<sup>2</sup>",
xaxis_title="rooms"
)
fig.show()
Let's visualize housing prices per district. To do this, we will aggregate the prices per place.
The cleaned dataframe becomes:
plot_df = df[["place", "price"]][df.time == df.time.max()].groupby(["place"]).mean().astype(int).sort_values(by="price").reset_index()
plot_df
| place | price | |
|---|---|---|
| 0 | Kreis 12 | 8748 |
| 1 | Kreis 9 | 9995 |
| 2 | Kreis 4 | 10620 |
| 3 | Kreis 11 | 11051 |
| 4 | Kreis 10 | 11281 |
| 5 | Kreis 3 | 12380 |
| 6 | Kreis 1 | 13466 |
| 7 | Kreis 6 | 13818 |
| 8 | Kreis 5 | 14176 |
| 9 | Kreis 2 | 14624 |
| 10 | Kreis 7 | 15970 |
| 11 | Kreis 8 | 17354 |
fig = px.bar(plot_df, x="place", y="price")
fig.update_layout(
title='Housing prices in Zürich',
title_x=0.5,
yaxis_title="CHF per m<sup>2</sup>",
xaxis_title="district",
)
fig.show()
Now, let's combine both views. We will visualize housing prices across districts and apartment types.
The plotting dataframe becomes:
plot_df = df[["place", "rooms", "price"]][df.time == df.time.max()]
plot_df = plot_df[plot_df.rooms <= 4]
plot_df["avg_price"] = plot_df[["price", "place"]].groupby(["place"]).transform(lambda x: sum(x)/len(x))
plot_df = plot_df.sort_values(by=["avg_price", "rooms"])
plot_df.head()
| place | rooms | price | avg_price | |
|---|---|---|---|---|
| 548 | Kreis 12 | 1 | 7826.0 | 8748.50 |
| 525 | Kreis 12 | 2 | 8404.0 | 8748.50 |
| 565 | Kreis 12 | 3 | 8987.0 | 8748.50 |
| 557 | Kreis 12 | 4 | 9777.0 | 8748.50 |
| 543 | Kreis 4 | 1 | 9955.0 | 10620.75 |
fig = make_subplots(rows=3, cols=4, subplot_titles=plot_df["place"].unique(), shared_yaxes=True, y_title='CHF per m<sup>2</sup>', x_title='rooms', vertical_spacing=0.1)
for i, district in enumerate(plot_df["place"].unique()):
row = i//4 + 1
col = i%4 + 1
subset = plot_df[plot_df["place"] == district]
fig.append_trace(go.Bar(
x=subset["rooms"],
y=subset["price"],
name=district,
marker_color=px.colors.qualitative.Dark24[0]
), row=row, col=col)
fig.update_layout(height=800, width=1000, title={"text": "Housing prices in Zürich", "x": 0.5}, showlegend=False)
fig.update_yaxes(range=[0,20000])
fig.show()
Let's take a look at averge housing prices over time. We will aggregate te results by time.
The plotting dataframe becomes:
plot_df = df[["time", "price"]].groupby(["time"]).mean().reset_index()
plot_df.head()
| time | price | |
|---|---|---|
| 0 | 2009-12-31 | 8702.231884 |
| 1 | 2010-12-31 | 9206.983871 |
| 2 | 2011-12-31 | 10061.363636 |
| 3 | 2012-12-31 | 11442.333333 |
| 4 | 2013-12-31 | 11096.770492 |
fig = px.line(plot_df, x='time', y="price")
fig.update_layout(
title='Housing prices in Zürich',
title_x=0.5,
yaxis_title="CHF per m<sup>2</sup>",
)
fig.show()
Now that we have seen the trend, let's dig a bit deeper. We will visualize housing prices over time for 2, 4and 6-rooms apartments. To do it, we need to reshape our dataframe to show prices per apartment type over time.
The reshaped dataframe becomes:
plot_df = pd.pivot_table(df, index="time", columns="rooms", values="price", aggfunc=np.mean).reset_index()
plot_df.head()
| rooms | time | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009-12-31 | 8262.818182 | 8375.750000 | 8570.636364 | 8296.636364 | 9629.4 | 9994.666667 | 7241.25 | 9571.75 |
| 1 | 2010-12-31 | 9363.800000 | 8776.272727 | 9397.666667 | 8553.818182 | 8574.4 | 9704.200000 | 10356.00 | 18815.00 |
| 2 | 2011-12-31 | 8950.727273 | 12086.166667 | 8647.166667 | 9840.833333 | 9371.1 | 10297.000000 | 12704.00 | 14049.00 |
| 3 | 2012-12-31 | 10207.000000 | 10508.916667 | 10583.000000 | 11544.909091 | 11655.4 | 13159.857143 | 18831.00 | 13346.00 |
| 4 | 2013-12-31 | 10805.571429 | 10163.181818 | 11118.000000 | 11710.750000 | 11023.0 | 12801.800000 | 11085.00 | 8092.00 |
fig = make_subplots(rows=3, cols=1, y_title='CHF per m<sup>2</sup>')
for i, j in enumerate([2,4,6]):
fig.append_trace(go.Scatter(
x=plot_df["time"],
y=plot_df[j],
name="Rooms: {}".format(j),
marker_color=px.colors.qualitative.Dark24[i]
), row=i+1, col=1)
fig.update_layout(title={"text": "Housing prices in Zürich", "x": 0.5}, showlegend=True)
fig.update_yaxes(range=[8000,14000])
fig.show()
The real estate prices are influenced by:
Let's take a look at how those numbers it evolved over time. Mainly, we want find the apartments and population count in city of Zurich.
The query for number of inhabitants and apartments over time is:
query = """
SELECT *
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE{
{
SELECT ?time (SUM(?pop_count) AS ?pop)
WHERE {
ssz:BEW a cube:Cube;
cube:observationSet/cube:observation ?obs_bew.
?obs_bew property:TIME ?time ;
property:RAUM ?place_uri_pop;
measure:BEW ?pop_count .
?place_uri_pop skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place_pop .
FILTER regex(str(?place_pop),"ab|Stadtgebiet vor")
}
GROUP BY ?time
}
{
SELECT ?time (SUM(?apt_count) AS ?apts)
WHERE {
ssz:WHG a cube:Cube;
cube:observationSet/cube:observation ?obs_apt.
?obs_apt property:TIME ?time ;
property:RAUM ?place_uri_apt;
measure:WHG ?apt_count .
?place_uri_apt skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place .
FILTER regex(str(?place),"ab|Stadtgebiet vor")
}
GROUP BY ?time
}
}
ORDER BY ?time
"""
df = sparql.send_query(query)
df.head()
| time | pop | apts | |
|---|---|---|---|
| 0 | 1896-12-31 | 39142.0 | 22334.0 |
| 1 | 1897-12-31 | 39553.0 | 23776.0 |
| 2 | 1898-12-31 | 39728.0 | 24811.0 |
| 3 | 1899-12-31 | 39623.0 | 25565.0 |
| 4 | 1900-12-31 | 39388.0 | 26138.0 |
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(x=df["time"], y=df["pop"], name="Population"),
secondary_y=False,
)
fig.add_trace(
go.Scatter(x=df["time"], y=df["apts"], name="Apartments"),
secondary_y=True,
)
fig.update_layout(title={"text": "Population and apartments in Zürich", "x": 0.5})
fig.update_yaxes(title_text="population", secondary_y=False)
fig.update_yaxes(title_text="apartments", secondary_y=True)
fig.show()
Let's take a look at what apartments were, and are available in Zurich. How did the share of various apparmnent types evolve over time?
Is there a trend to build more smaller (or bigger) apartments?
These insights are available in WHG-ZIM data cube.
The query for number of different apartment types over time is:
query = """
SELECT ?time ?rooms (SUM(?count) AS ?apts)
FROM <https://lindas.admin.ch/stadtzuerich/stat>
WHERE {
ssz:WHG-ZIM a cube:Cube;
cube:observationSet/cube:observation ?obs.
?obs property:TIME ?time ;
property:RAUM ?place_uri;
property:ZIM/schema:name ?rooms ;
measure:WHG ?count .
?place_uri skos:inScheme <https://ld.stadt-zuerich.ch/statistics/scheme/Kreis> ;
schema:name ?place .
FILTER regex(str(?place),"ab|Stadtgebiet vor")
FILTER (?time >= "1977-01-01"^^xsd:time)
}
GROUP BY ?time ?rooms
ORDER BY ?time ?rooms
"""
df = sparql.send_query(query)
df.head()
| time | rooms | apts | |
|---|---|---|---|
| 0 | 1977-12-31 | 1- und 1,5-Zimmer Wohnung | 26132.0 |
| 1 | 1977-12-31 | 2- und 2,5-Zimmer Wohnung | 34686.0 |
| 2 | 1977-12-31 | 3- und 3,5-Zimmer Wohnung | 68026.0 |
| 3 | 1977-12-31 | 4- und 4,5-Zimmer Wohnung | 33927.0 |
| 4 | 1977-12-31 | 5- und 5,5-Zimmer Wohnung | 7593.0 |
Let's reshape our dataframe to show the apartment count per apartment type, over time:
def rename_rooms(x: str) -> str:
elements = x.split("- und ")
if elements[1] == "mehr-Zimmer Wohnung":
return elements[0] + "+"
else:
return elements [0]
df.rooms = df.rooms.apply(rename_rooms)
df = pd.pivot_table(df, index="time", columns="rooms", values="apts")
# Data cleaning
df["5+"][df["5+"].isna()] = df["5"][df["5+"].isna()] + df["6+"][df["5+"].isna()]
df = df[["1", "2", "3", "4", "5+"]].reset_index().rename_axis(None, axis=1)
df = df.fillna(method="ffill")
df.head()
| time | 1 | 2 | 3 | 4 | 5+ | |
|---|---|---|---|---|---|---|
| 0 | 1977-12-31 | 26132.0 | 34686.0 | 68026.0 | 33927.0 | 12728.0 |
| 1 | 1978-12-31 | 26280.0 | 34963.0 | 68080.0 | 34066.0 | 12706.0 |
| 2 | 1979-12-31 | 27536.0 | 35324.0 | 68139.0 | 34147.0 | 12696.0 |
| 3 | 1980-12-31 | 28039.0 | 35717.0 | 68465.0 | 34328.0 | 12706.0 |
| 4 | 1981-12-31 | 27398.0 | 36103.0 | 68874.0 | 34851.0 | 12790.0 |
cols = ["1", "2", "3", "4", "5+"]
start = df[cols].iloc[0] / sum(df[cols].iloc[0])
end = df[cols].iloc[-1] / sum(df[cols].iloc[-1])
diff = abs(end-start)
cols = [x for _, x in sorted(zip(diff, cols))]
fig = px.histogram(df, x="time", y=cols, barnorm="percent")
fig.update_layout(
title='Apartments in Zürich',
title_x=0.5,
yaxis_title="% of apartments",
legend_title="Rooms"
)
fig.show()